Extracting Features from Textual Data in Class Imbalance Problems
نویسندگان
چکیده
We address class imbalance problems. These are classification problems where the target variable is binary, and one dominates over other. A central objective in these to identify features that yield models with high precision/recall values, standard yardsticks for assessing such models. Our extracted from textual data inherent use n-gram frequencies as introduce a discrepancy score measures efficacy of an highlighting minority class. The frequency counts n-grams highest scores used construct desired metrics. According best practices followed by services industry, many customer support tickets will get audited tagged “contract-compliant” whereas some be “over-delivered”. Based on in-field data, we random forest classifier perform randomized grid search model hyperparameters. scoring performed using function. minimize follow-up costs optimizing recall while maintaining base-level precision score. final optimized achieves acceptable staying above precision. validate our feature selection method comparing constructed chosen randomly. propose extensions extraction general (binary multi-class) regression measure dissimilarity distributions other (more general) formulate could potentially more effective
منابع مشابه
Extracting Information from Citeseer’s Textual Data
This article deals with CiteSeer, a free online digital library and search engine of mainly computer science research papers. First, it discusses CiteSeer’s features and structure and then it presents what useful information on publications and author collaborations can be extracted from its textual data. We show the basic properties of both the publication citation and author citation graph. M...
متن کاملExtracting Predictor Variables to Construct Breast Cancer Survivability Model with Class Imbalance Problem
Application of data mining methods as a decision support system has a great benefit to predict survival of new patients. It also has a great potential for health researchers to investigate the relationship between risk factors and cancer survival. But due to the imbalanced nature of datasets associated with breast cancer survival, the accuracy of survival prognosis models is a challenging issue...
متن کاملExtracting Coactivated Features from Multiple Data Sets
We present a nonlinear generalization of Canonical Correlation Analysis (CCA) to find related structure in multiple data sets. The new method allows to analyze an arbitrary number of data sets, and the extracted features capture higher-order statistical dependencies. The features are independent components that are coupled across the data sets. The coupling takes the form of coactivation (depen...
متن کاملBreast Cancer Diagnosis from Perspective of Class Imbalance
Introduction: Breast cancer is the second cause of mortality among women. Early detection is the only rescue to reduce the risk of breast cancer mortality. Traditional methods cannot effectively diagnose tumor since they are based on the assumption of well-balanced dataset.. However, a hybrid method can help to alleviate the two-class imbalance problem existing in the ...
متن کاملClass Imbalance Problem in Data Mining Review
In last few years there are major changes and evolution has been done on classification of data. As the application area of technology is increases the size of data also increases. Classification of data becomes difficult because of unbounded size and imbalance nature of data. Class imbalance problem become greatest issue in data mining. Imbalance problem occur where one of the two classes havi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Journal of computer-assisted linguistic research
سال: 2022
ISSN: ['2530-9455']
DOI: https://doi.org/10.4995/jclr.2022.18200